Categories

Versions

Process Documents from Mail Store (Text Processing)

Synopsis

Generates word vectors from a text collection stored in an IMAP or POP3 mail server.

Input

  • word list

    The word list port.

  • connection (Connection)

    This port can take a connection of type Mail (retrieve).

Output

  • example set (Data Table)

    The example set port.

  • word list

    The word list port.

  • connection (Connection)

    If the input port connection has data, it will be put through to this output port.

Parameters

  • mail_account The mail connection to use to retrieve the email. Only visible if the connection input port is not connected and the compatibility level is above 9.3.1. Range:
  • create_word_vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document. Range:
  • vector_creationSelect the schema for creating the word vector. Range:
  • add_meta_informationIf checked, available meta information of the text like filename, date is added as attribute. Range:
  • keep_textIf checked, the input text will be stored as a special String attribute with the role text. Range:
  • prune_methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified. Range:
  • prune_below_percentIgnore words that appear in less than this percentage of all documents. Range:
  • prune_above_percentIgnore words that appear in more than this percentage of all documents. Range:
  • prune_below_absoluteIgnore words that appear in less than that many documents. Range:
  • prune_above_absoluteIgnore words that appear in more than that many documents. Range:
  • prune_below_rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned. Range:
  • prune_above_rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned. Range:
  • datamanagementDetermines, how the data is represented internally. Range:
  • define_storeMail store connection can be defined by using either a session bound to a JNDI name, or explicitly by specifying host and user. Range:
  • jndi_nameJNDI name referencing a mail session. Range:
  • hostIMAP or POP3 host name Range:
  • userIMAP or POP3 user name Range:
  • passwordIMAP or POP3 password Range:
  • connection_propertiesAdditional properties for the mail store. Range:
  • protocolIMAP or POP3 Range:
  • only_unseenIf checked, only new unseen messages will be processed. Range:
  • mark_seenIf checked, all processed messages will be marked read. Only works with IMAP, not with POP3. Range:
  • delete_messagesIf checked, all processed messages will be deleted. Especially useful for POP3 Range:
  • recursiveRecurse into subfolders? Range:
  • folderName of the IMAP folder to scan. Must be INBOX for POP3. Range:
  • download attachmentsselect to download mails and attachments Range:
  • attachment file-patternA pattern for the attachment you want to select. Usual wildcards like ? and * are supported. Range:
  • attachment MIME-typetype in the MIME-type you want to select.(if this label and all additional labels are empty all MIME-types are selected) Range:
  • parallelize_vector_creationDetermines whether the execution of Vector Creation should be parallelized. Range: